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Extended Abstract 


The modern incarnation of neural networks, now popularly known as Deep Learning (DL), accom¬ 
plished record-breaking success in processing diverse kinds of signals - vision, audio, and text. In 
parallel, strong interest has ensued towards constructing a theory of DL. This paper opens up a 
group theory based approach, towards a theoretical understanding of DL, in particular the unsuper¬ 
vised variant. First we establish how a single layer of unsupervised pre-training can be explained in 
the light of orbit-stabilizer principle, and then we sketch how the same principle can be extended for 
multiple layers. 

We focus on two key principles that (amongst others) influenced the modern DL resurgence. 


(PI) Geoff Hinton summed this up as follows. “In order to do computer vision, first learn how 
to do computer graphics”. |Hinton[ |2007| . In other words, if a network learns a good 
generative model of its training set, then it could use the same model for classification. 


(P2) Instead of learning an entire network all at once, learn it one layer at a time. 


In each round, the training layer is connected to a temporary output layer and trained to learn the 
weights needed to reproduce its input (i.e to solve PI). This step - executed layer-wise, starting with 
the first hidden layer and sequentially moving deeper - is often referred to as pre-training (seejpinton 
et al.| ( |20Q6| ); |Hinton| ( |20Q7| ); [Salakhutdinov & Hinton| ( |2009] ); |Bengio et al.| ( |m~preparation| )) and the 
resulting layer is called an autoencoder. Figure |1 (a) | shows a schematic autoencoder. Its weight set 
W\ is learnt by the network. Subsequently when presented with an input /, the network will produce 
an output f ~ /. At this point the output units as well as the weight set W 2 are discarded. 

There is an alternate characterization of PI. An autoencoder unit, such as the above, maps an input 
space to itself. Moreover, after learning, it is by definition, a stabilize^ of the input /. Now, input 
signals are often decomposable into features, and an autoencoder attempts to find a succinct set of 
features that all inputs can be decom posed into. Satisfying Plmeans that the learned configurations 
can reproduce these features. Figure |T(b)] illustrates this post-training behavior. If the hidden units 
learned features /u/ 2 , • •and one of then, say ft, comes back as input, the output must be fi. In 
other words learning a feature is equivalent to searching for a transformation that stabilizes it. 

The idea of stabilizers invites an analogy reminiscent of the orbit-stabilizer relationship studied in 
the theory of group actions. Suppose G is a group that acts on a set X by moving its points around 
(e.g groups of 2 x 2 invertible matrices acting over the Euclidean plane). Consider xGl, and let O x 
be the set of all points reachable from x via the group action. O x is called an orbi0 A subset of the 
group elements may leave x unchanged. This subset S x (which is also a subgroup), is the stabilizer 
of x. If it is possible to define a notion of volume for a group, then there is an inverse relationship 


*This research supported in part by the NSF under grant BIGDATA-1251049 
1 A transformation T is called a stabilizer of an input /, if f = T (/) = /. 

2 The orbit O x of an element xeX under the action of a group G, is defined as the set O x = {#(*) G X\g e G}. 
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(c) Alternate Decomposition of a Signal 


Figure 1: (a) W\ is preserved, W 2 discarded (b) Post-learning, each feature is stabilized 
(c)Alternate ways of decomposing a signal into simpler features. The neurons could potentially 
learn features in the top row, or the bottom row. Almost surely, the simpler ones (bottom row) are 
learned. 


between the volumes of S x and O x , which holds even if x is actually a subset (as opposed to being a 
point). For example, for finite groups, the product of \O x \ and \S X \ is the order of the group. 

The inverse relationship between the volumes of orbits and stabilizers takes on a central role as we 
conne ct this back to DL. There are many possible ways to decompose signals into smaller features. 
Figure [T(c)| illustrates this point: a rectangle can be decomposed into L-shaped features or straight- 
line edges. 


All experiments to date suggest that a neural network is likely to learn the edges. But why? To 
answer this, imagine that the space of the autoencoders (viewed as transformations of the input) 
form a group. A batch of learning iterations stops whenever a stabilizer is found. Roughly speaking, 
if the search is a Markov chain (or a guided chain such as MCMC), then the bigger a stabilizer, 
the earlier it will be hit. The group structure implies that this big stabilizer corresponds to a small 
orbit. Now intuition suggests that the simpler a feature, the smaller is its orbit. For example, a 
line-segment generates many fewer possible shapes under linear deformations than a flower-like 
shape. An autoencoder then should learn these simpler features first, which falls in line with most 
experiments (see jLee et al. ( 2009| ). 


The intuition naturally extends to a many-layer scenario. Each hidden layer finding a feature with a 
big stabilizer. But beyond the first level, the inputs no longer inhabit the same space as the training 
samples. A “simple” feature over this new space actually corresponds to a more complex shape in 
the space of input samples. This process repeats as the number of layers increases. In effect, each 
layer learns “edge-like features” with respect to the previous layer, and from these locally simple 
representations we obtain the learned higher-order representation. 
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